In-class Exercise 5

Author

Zhu Yiting

Published

December 17, 2022

Objective

In this in-class exercise, we wish to build a logistic regression model for the water point status (functional or non-functional) at Osun state, Nigeria.

Getting started

The code chunk below installs and loads the following R packages:

  • sf

  • tidyverse

  • funModeling

  • blorr - for logistic regression

  • corrplot

  • ggpubr

  • sf

  • spdep

  • GWmodel

  • tmap

  • skimr - to do EDA

  • caret - for error matrix and comparison (of models?)

pacman::p_load(sf, tidyverse, funModeling,
               blorr, corrplot, ggpubr, 
               sf, spdep, GWmodel,
               tmap, skimr, caret)

Importing the Analytical Data

The code chunk below brings the datasets into R.

osun <- read_rds("rds/Osun.rds")
wp <- read_rds("rds/Osun_wp_sf.rds")

osun contains the ADM2 polygon boundaries for Osun state of Nigeria, and wp is the water point data in Osun state of Nigeria.

The rds files have been pre-processed and wrangled (e.g. cleaning up of variables and variable names).

Next, we check the status field of the wp sf data frame object. Note that the data type for this field is logi, i.e. it only takes the values of TRUE or FALSE. This was recoded from the field status_clean of the water point data set, where

  • observations without the status information (NaN) are filtered away,

  • all remaining values that indicates that the water point is functional are recoded as T, and

  • the rest are recoded as F.

wp %>%
  freq(input = "status")

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

We see that there are 2,642 TRUE values and 2,118 FALSE values.

Bear in mind that linear and logistic regression models do not like missing values - the entire record across all variables would be removed if one or more of the variables have missing values. Hence, it is important to check upfront for missing values and exclude variables which have a significant proportion of missing values (e.g. 20%).

Next, we plot the distribution of the water points by status using tmap, as shown in the code chunk below.

tmap_mode("view")
tm_shape(osun) +
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
  tm_shape(wp) +
  tm_dots(col = "status",
          alpha = 0.6) +
  tm_view(set.zoom.limits = c(9, 12))

We set tmap_mode() back to “plot” option after plotting.

tmap_mode("plot")

Exploratory Data Analysis (EDA)

We look at the summary statistics of the water point dataset with skimr for preliminary variable selection, in the following code chunk.

wp %>%
  skim()
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

For example, we note that 1/4 of install_year are missing values. Hence, despite the variable being useful, we need to drop this variable from the logistic regression model building.

Based on the EDA, we will use the following variables for our model building, in the code chunk below.

wp_clean <- wp %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))

What we have done above are to:

  • exclude missing values (filtering for all_vars(!is.na(.))); and

  • recode usage_capacity as factor (it only has 3 classes) instead of numerical data type. This is because the calibration of logit function will be different.

We also check the wp_clean object in the Environment panel. We see that the number of observations dropped by 4 from 4,760 to 4,756, which signifies that the 4 missing values from the water_point_population and local_population_1km fields are successfully removed. We also check that the status field has been correctly recoded to Factor w/ 2 levels "300", "1000".

Correlation Analysis

We first extract the desired variables from wp_clean into a new object osun_wp, and remove the geometry column by setting st_set_geometry() to NULL so that we can do a correlation matrix plot.

osun_wp <- wp_clean %>%
  select(c(7, 35:39, 42:43, 46:47, 57)) %>%
  st_set_geometry(NULL)

Next, we plot the correlation matrix for all the numerical data fields.

cluster_vars.cor = cor(
  osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

We see that none of the variables are highly correlated with any other variable (r no more than 0.85). Hence, we will keep all the variables for logistic regression model building in the next section.

Building a Logistic Regression Model

In the code chunk below, glm() of R stats is used to calibrate a logistic regression for the water point status.

model <- glm(status ~ distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = wp_clean,
             family = binomial(link = "logit"))

Instead of using a typical R report, we use blr_regress() of blorr to generate the model report in scientific literature reporting format.

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

At 95% confidence level, variables with p-value less than 0.05 are statistically significant. These are all independent variables except distance_to_primary_road and distance_to_secondary_road.

For interpretation of logistic regression report:

  • Categorical variables: A positive value implies an above average correlation and a negative value implies a below average correlation, while the magnitude of the coefficient does not matter for categorical variables;

  • Continuous variables: a positive value implies a direct correlation and a negative value implies an inverse correlation, while the magnitude of the value gives the strength of the correlation.

We generate the confusion matrix for the model using blr_confusion_matrix() of blorr.

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

The validity of a cut-off (here we use 0.5) is measured using sensitivity, specificity and accuracy.

  • Sensitivity: The % of correctly classified positive events out of all predicted positive events = TP / (TP + FN).

  • Specificity: The % of correctly classified negative events out of all predicted negative events = TN / (TN + FP).

  • Accuracy: The % of correctly classified events out of all events = (TP + TN) / (TP + FP + TN + FN).

  • Note that TP = true positve, TN = true negative, FP = false positive and FN = false negative.

From the output, we see that the model gives us an accuracy of 0.6739, which is a good start as it is better than guessing (0.5).

The sensitivity and specificity are 0.7207 and 0.6154 respectively. This shows that the true positives (functional water points) are slightly higher than the true negative prediction rates (non-functional water points).

Building Geographically Weighted Logistic Regression (gwLR) Model

Converting from sf to sp Data Frame

Next, we need to convert the sf data frame into spatial point data frame for GWR model building. This is done using the code chunk below.

wp_sp <- wp_clean %>%
  select(c(status,
           distance_to_primary_road,
           distance_to_secondary_road,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           is_urban,
           usage_capacity,
           water_source_clean)) %>%
  as_Spatial()
wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

Note that we use the cleaned version of the water point sf data frame for consistency in the geometrics with our model building (4 water points with missing values excluded).

Building Fixed Bandwidth GWR Model

bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
                      distance_to_secondary_road +
                      distance_to_tertiary_road +
                      distance_to_city +
                      distance_to_town +
                      is_urban +
                      usage_capacity +
                      water_source_clean +
                      water_point_population +
                      local_population_1km,
                      data = wp_sp,
                    family = "binomial",
                    approach  = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE, # for fixed bandwidth
                    longlat = FALSE) # input data have been converted to projected CRS

Model Assessment

bw.fixed

[1] 2599.672

We get the above output. We feed it into the bw argument in ggwr.basic() of GWmodel in the code chunk below.

gwlr.fixed <- ggwr.basic(status ~ distance_to_primary_road +
                           distance_to_secondary_road +
                           distance_to_tertiary_road +
                           distance_to_city +
                           distance_to_town +
                           is_urban +
                           usage_capacity +
                           water_source_clean +
                           water_point_population +
                           local_population_1km,
                      data = wp_sp,
                      bw = 2599.672,
                      family = "binomial",
                      kernel = "gaussian",
                      adaptive = FALSE,
                      longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 

We look at the results below. Similar to when we build multiple linear regression model, the report has 2 sections - generalised regression (global model) results and geographically weighted (GW) regression results. Note that the global model does not have AICc result, so AIC should be used to compare the 2 models.

gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 16:40:22 
   Call:
   ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road + 
    distance_to_tertiary_road + distance_to_city + distance_to_town + 
    is_urban + usage_capacity + water_source_clean + water_point_population + 
    local_population_1km, data = wp_sp, bw = 2599.672, family = "binomial", 
    kernel = "gaussian", adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-124.555    -1.755     1.072     1.742    34.333  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.887e-01  1.124e-01   3.459 0.000543
distance_to_primary_road                 -4.642e-06  6.490e-06  -0.715 0.474422
distance_to_secondary_road               -5.143e-06  9.299e-06  -0.553 0.580230
distance_to_tertiary_road                 9.683e-05  2.073e-05   4.671 3.00e-06
distance_to_city                         -1.686e-05  3.544e-06  -4.757 1.96e-06
distance_to_town                         -1.480e-05  3.009e-06  -4.917 8.79e-07
is_urbanTRUE                             -2.971e-01  8.185e-02  -3.629 0.000284
usage_capacity1000                       -6.230e-01  6.972e-02  -8.937  < 2e-16
water_source_cleanProtected Shallow Well  5.040e-01  8.574e-02   5.878 4.14e-09
water_source_cleanProtected Spring        1.288e+00  4.388e-01   2.936 0.003325
water_point_population                   -5.097e-04  4.484e-05 -11.369  < 2e-16
local_population_1km                      3.451e-04  1.788e-05  19.295  < 2e-16
                                            
Intercept                                ***
distance_to_primary_road                    
distance_to_secondary_road                  
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
water_point_population                   ***
local_population_1km                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.0  on 4744  degrees of freedom
AIC: 5712

Number of Fisher Scoring iterations: 5


 AICc:  5712.099
 Pseudo R-square value:  0.1295351
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2599.672 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -8.7229e+02 -4.9955e+00  1.7600e+00
   distance_to_primary_road                 -1.9389e-02 -4.8031e-04  2.9618e-05
   distance_to_secondary_road               -1.5921e-02 -3.7551e-04  1.2317e-04
   distance_to_tertiary_road                -1.5618e-02 -4.2368e-04  7.6179e-05
   distance_to_city                         -1.8416e-02 -5.6217e-04 -1.2726e-04
   distance_to_town                         -2.2411e-02 -5.7283e-04 -1.5155e-04
   is_urbanTRUE                             -1.9790e+02 -4.2908e+00 -1.6864e+00
   usage_capacity1000                       -2.0772e+01 -9.7231e-01 -4.1592e-01
   water_source_cleanProtected.Shallow.Well -2.0789e+01 -4.5190e-01  5.3340e-01
   water_source_cleanProtected.Spring       -5.2235e+02 -5.5977e+00  2.5441e+00
   water_point_population                   -5.2208e-02 -2.2767e-03 -9.8875e-04
   local_population_1km                     -1.2698e-01  4.9952e-04  1.0638e-03
                                                3rd Qu.      Max.
   Intercept                                 1.2763e+01 1073.2156
   distance_to_primary_road                  4.8443e-04    0.0142
   distance_to_secondary_road                6.0692e-04    0.0258
   distance_to_tertiary_road                 6.6815e-04    0.0128
   distance_to_city                          2.3718e-04    0.0150
   distance_to_town                          1.9271e-04    0.0224
   is_urbanTRUE                              1.2841e+00  744.3099
   usage_capacity1000                        3.0322e-01    5.9281
   water_source_cleanProtected.Shallow.Well  1.7849e+00   67.6343
   water_source_cleanProtected.Spring        6.7663e+00  317.4133
   water_point_population                    5.0102e-04    0.1309
   local_population_1km                      1.8157e-03    0.0392
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2795.084 
   AIC : 4414.606 
   AICc : 4747.423 
   Pseudo R-square value:  0.5722559 

   ***********************************************************************
   Program stops at: 2022-12-17 16:40:54 

Comparing the AIC values of the 2 models, we see that it is lower for the GW regression model at 4,414.606 then for the global regression model at 5,712.

Converting SDF into sf Data Frame

To assess the performance of the gwLR, we will first convert the SDF object in as data frame by using te code chunk below.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will label yhat values (probability values for functional water points) greater or equal to 0.5 as 1 and otherwise 0. The result of the logic comparison operation will be saved into a field called most.

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T, F))

Confusion Matrix

Next, we use confusionMatrix() of caret to display the confusion matrix of the GW model using fixed bandwidth method.

gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most, # predicted outcome
                      reference = gwr.fixed$y, # reference y
                      positive = "TRUE") # setting positive class to "TRUE"
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1824  263
     TRUE    290 2379
                                          
               Accuracy : 0.8837          
                 95% CI : (0.8743, 0.8927)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7642          
                                          
 Mcnemar's Test P-Value : 0.2689          
                                          
            Sensitivity : 0.9005          
            Specificity : 0.8628          
         Pos Pred Value : 0.8913          
         Neg Pred Value : 0.8740          
             Prevalence : 0.5555          
         Detection Rate : 0.5002          
   Detection Prevalence : 0.5612          
      Balanced Accuracy : 0.8816          
                                          
       'Positive' Class : TRUE            
                                          

We see that the accuracy (0.8837 vs 0.6739), sensitivity (0.9005 vs 0.7207) and specificity (0.8628 vs 0.6154) values have all improved from the non-gwLR global model. By using the gwLR model, we can explain the functional and non-functional water points better now which allows better management of water points through localised strategies (e.g. look at the local neighbourhood regions within Osun state).

Visualising gwLR

Before we visualise the results of the gwLR model, we clean up the data set for plotting by selecting the relevant data fields (mainly the status column which is the dependent or predicted variable) into a new sf data frame object wp_sf_select in the code chunk below.

wp_sf_select <- wp_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
           status))

We then combine it with gwr.fixed which has the predicted values of the water point status, in the form of probabilities between 0 and 1.

gwr_sf.fixed <- cbind(wp_sf_select, gwr.fixed)

The code chunk below is used to create an interactive point symbol map.

tmap_mode("view")

actual <- tm_shape(osun) +
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
  tm_shape(wp) +
  tm_dots(col = "status",
          alpha = 0.6,
          palette = "YlOrRd") +
  tm_view(set.zoom.limits = c(9, 12))

prob_T <- tm_shape(osun) +
  tm_polygons(alpha = 0.4) +
  tm_shape(gwr_sf.fixed) + 
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(9, 12))

tmap_arrange(actual, prob_T, 
             asp = 1, ncol = 2, sync = TRUE)

We see that the predictions are largely aligned with the actual status of the water points, in line with the 88% accuracy rate.

Employing Only Statistically Significant Variables in Global and gwLR Models

As we earlier saw that 2 of the 10 variables, distance_to_primary_road and distance_to_secondary_road, are not statistically significant (p-values > 0.05), we should build logistic regression models without these 2 variables.

Hence, we repeat the relevant steps above to replicate the model building, assessment and visualisation process in the following code chunks, starting with constructing the model with only the 8 statistically significant variables.

Building Global Model

model_refined <- glm(status ~ distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = wp_clean,
             family = binomial(link = "logit"))

blr_regress(model_refined)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------

We check and see that the remaining variables are all statistically significant to the linear regression model (p-values < 0.05).

Confusion Matrix for Global Model

The code chunk below calculates and displays the confusion matrix for the refined model. We will discuss the results together with that for the refined gwLR model in the subsequent subsection.

blr_confusion_matrix(model_refined, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

Determining Fixed Bandwidth for GWR Model

bw.fixed_refined <- bw.ggwr(status ~ distance_to_tertiary_road +
                      distance_to_city +
                      distance_to_town +
                      is_urban +
                      usage_capacity +
                      water_source_clean +
                      water_point_population +
                      local_population_1km,
                      data = wp_sp,
                    family = "binomial",
                    approach  = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE, # for fixed bandwidth
                    longlat = FALSE) # input data have been converted to projected CRS
bw.fixed_refined

[1] 2377.371

The output for bw.fixed_refined is given above. We will use this optimal fixed distance value for model assessment in the next subsection.

Model Assessment

gwlr.fixed_refined <- ggwr.basic(status ~ distance_to_tertiary_road +
                           distance_to_city +
                           distance_to_town +
                           is_urban +
                           usage_capacity +
                           water_source_clean +
                           water_point_population +
                           local_population_1km,
                      data = wp_sp,
                      bw = 2377.371,
                      family = "binomial",
                      kernel = "gaussian",
                      adaptive = FALSE,
                      longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
gwlr.fixed_refined
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 16:40:56 
   Call:
   ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city + 
    distance_to_town + is_urban + usage_capacity + water_source_clean + 
    water_point_population + local_population_1km, data = wp_sp, 
    bw = 2377.371, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-129.368    -1.750     1.074     1.742    34.126  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.540e-01  1.055e-01   3.354 0.000796
distance_to_tertiary_road                 1.001e-04  2.040e-05   4.910 9.13e-07
distance_to_city                         -1.764e-05  3.391e-06  -5.202 1.97e-07
distance_to_town                         -1.544e-05  2.825e-06  -5.466 4.60e-08
is_urbanTRUE                             -2.667e-01  7.474e-02  -3.569 0.000358
usage_capacity1000                       -6.206e-01  6.966e-02  -8.908  < 2e-16
water_source_cleanProtected Shallow Well  4.947e-01  8.496e-02   5.823 5.79e-09
water_source_cleanProtected Spring        1.279e+00  4.384e-01   2.917 0.003530
water_point_population                   -5.098e-04  4.476e-05 -11.390  < 2e-16
local_population_1km                      3.452e-04  1.779e-05  19.407  < 2e-16
                                            
Intercept                                ***
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
water_point_population                   ***
local_population_1km                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.9  on 4746  degrees of freedom
AIC: 5708.9

Number of Fisher Scoring iterations: 5


 AICc:  5708.923
 Pseudo R-square value:  0.129406
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2377.371 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -3.7021e+02 -4.3797e+00  3.5590e+00
   distance_to_tertiary_road                -3.1622e-02 -4.5462e-04  9.1291e-05
   distance_to_city                         -5.4555e-02 -6.5623e-04 -1.3507e-04
   distance_to_town                         -8.6549e-03 -5.2754e-04 -1.6785e-04
   is_urbanTRUE                             -7.3554e+02 -3.4675e+00 -1.6596e+00
   usage_capacity1000                       -5.5889e+01 -1.0347e+00 -4.1960e-01
   water_source_cleanProtected.Shallow.Well -1.8842e+02 -4.7295e-01  6.2378e-01
   water_source_cleanProtected.Spring       -1.3630e+03 -5.3436e+00  2.7714e+00
   water_point_population                   -2.9696e-02 -2.2705e-03 -1.2277e-03
   local_population_1km                     -7.7730e-02  4.4281e-04  1.0548e-03
                                                3rd Qu.      Max.
   Intercept                                 1.3755e+01 2171.6375
   distance_to_tertiary_road                 6.3011e-04    0.0237
   distance_to_city                          1.5921e-04    0.0162
   distance_to_town                          2.4490e-04    0.0179
   is_urbanTRUE                              1.0554e+00  995.1841
   usage_capacity1000                        3.9113e-01    9.2449
   water_source_cleanProtected.Shallow.Well  1.9564e+00   66.8914
   water_source_cleanProtected.Spring        7.0805e+00  208.3749
   water_point_population                    4.5879e-04    0.0765
   local_population_1km                      1.8479e-03    0.0333
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2815.659 
   AIC : 4418.776 
   AICc : 4744.213 
   Pseudo R-square value:  0.5691072 

   ***********************************************************************
   Program stops at: 2022-12-17 16:41:23 

The AIC values of the 4 models are summarised in the table below.

AIC values Global model gwLR
All variables 5,712 4,414.606
Statistically significant variables only 5,708.9 4,418.776

We see that both gwLR models have lower AIC values than their global model counter parts. Between the 2 gwLR models, the one with all 10 variables have a slightly lower AIC value (4,414.606) than the one with only the 8 statistically significant variables (4,418.776).

Converting SDF into sf Data Frame

gwr.fixed_refined <- as.data.frame(gwlr.fixed_refined$SDF)

Assigning Cutoff Value of 0.5

gwr.fixed_refined <- gwr.fixed_refined %>%
  mutate(most = ifelse(
    gwr.fixed_refined$yhat >= 0.5, T, F))

Confusion Matrix for gwLR

We similarly call the confusion matrix and statistics using confusionMatrix() of caret in the code chunk below.

gwr.fixed_refined$y <- as.factor(gwr.fixed_refined$y)
gwr.fixed_refined$most <- as.factor(gwr.fixed_refined$most)
CM_refined <- confusionMatrix(data = gwr.fixed_refined$most,
                      reference = gwr.fixed_refined$y,
                      positive = "TRUE")
CM_refined
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7661          
                                          
 Mcnemar's Test P-Value : 0.6085          
                                          
            Sensitivity : 0.8986          
            Specificity : 0.8671          
         Pos Pred Value : 0.8942          
         Neg Pred Value : 0.8724          
             Prevalence : 0.5555          
         Detection Rate : 0.4992          
   Detection Prevalence : 0.5582          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : TRUE            
                                          

We see that the accuracy (0.8837 vs 0.6739), sensitivity (0.8628 vs 0.7207) and specificity (0.9005 vs 0.6154) values have all improved from the non-gwLR global model. By using the gwLR model, we can explain the non-functional water points better now which allows better management of water points through localised strategies (e.g. look at the local neighbourhood regions within Osun state).

The performance measures of the 4 logistic regression models are summarised in the table below.

Performance Measure Global regression with 10 variables gwLR with 10 variables Global regression with 8 variables gwLR with 8 variables
Accuracy 0.6739 0.8837 0.6726 0.8846
Sensitivity 0.7207 0.9005 0.7188 0.8986
Specificity 0.6154 0.8628 0.6149 0.8671

We see that the model accuracy and specificity improve very slightly by removing the non-statistically significant variables from the gwLR model, but the sensitivity drops slightly. Nevertheless, as we would be more interested in finding non-functional water points for maintenance etc., the gwLR model with 8 variables would be more useful with a higher specificity.

Visualising gwLR

Now we combined the prediction from the refined gwLR model with the water point sf data frame with selected variables for visualisation.

gwr_sf.fixed_refined <- cbind(wp_sf_select, gwr.fixed_refined)

We can similarly visualise the actual versus predicted functional (more red) and non-functional (more yellow) water points using the code chunk below.

tmap_mode("view")

actual <- tm_shape(osun) +
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
  tm_shape(wp) +
  tm_dots(col = "status",
          alpha = 0.6,
          palette = "YlOrRd") +
  tm_view(set.zoom.limits = c(9, 12))

prob_T_refined <- tm_shape(osun) +
  tm_polygons(alpha = 0.4) +
  tm_shape(gwr_sf.fixed_refined) + 
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(9, 12))

tmap_arrange(actual, prob_T_refined, 
             asp = 1, ncol = 2, sync = TRUE)

We see that the predictions are largely aligned with the actual status of the water points, in line with the 88% accuracy rate.